Rachel the Robo Caller - Modeling

At DEF CON 22, the FTC ran a contest to help mitigate robocalls. There were three rounds, the last of which was using a set of call records collected from a robocall honeypot to determine if a caller was a robocaller. See Parts I and II of the contest for details on robocaller honeypots.

The FTC gave us two sets of data, that show a phone call from one "person" to another along with the date and time. Both collections have been randomized uniquely, but the portions of area code and subscriber number were kept the same.

This Notebook is a follow up to Analyzing Rachel the Robo Caller and details creating a Random Forest classifier to predict robocallers.


In [31]:
from IPython.display import Image
Image("http://www.ftc.gov/system/files/attachments/zapping-rachel/zapping-rachel-contest.jpg")


Out[31]:

Initial setup


In [6]:
%matplotlib inline
# Standard toolkits in pydata land
import pandas as pd
import numpy as np

# Exploring the use of a RandomForest
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier

In [16]:
def read_FTC(dataset):
    '''Reads the csv format that the FTC provided for the Rachel the Robocaller contest into a pandas DataFrame'''
    return pd.read_csv(dataset,
                parse_dates=["DATE/TIME"],
                converters={'LIKELY ROBOCALL': lambda val: val == 'X'},
                dtype={'TO': str, 'FROM': str, 'LIKELY ROBOCALL': bool}
    )

In [34]:
def extract_features(ftc_row):
    ftc_row["HOUR"] = ftc_row["DATE/TIME"].hour
    ftc_row["MINUTE"] = ftc_row["DATE/TIME"].minute
    
    # Extract the area code using slicing since they are all regular US numbers
    ftc_row["TO_AREA_CODE"] = ftc_row["TO"][1:4]
    ftc_row["FROM_AREA_CODE"] = ftc_row["FROM"][1:4]
    
    # Extract area code + "office code"
    ftc_row["TO_OFFICE_CODE"] = ftc_row["TO"][1:7]
    ftc_row["FROM_OFFICE_CODE"] = ftc_row["FROM"][1:7]
    
    dt = ftc_row["DATE/TIME"]
        
    ftc_row["TIMECHUNK"] = dt.hour + np.floor(4*(dt.minute/60.0))/4
    
    return ftc_row

In [35]:
def total_call_volume(df, direction="FROM"):
    sizes = df.groupby(direction).size()

    def get_size(val):
        return sizes[val]

    return df[direction].apply(get_size)

In [36]:
def massage_ftc_dataframe(ftc_dataframe):
    massaged = ftc_dataframe.apply(extract_features, axis=1)
    
    massaged["NUM_FROM_CALLS"] = total_call_volume(massaged, "FROM")
    massaged["NUM_TO_CALLS"] = total_call_volume(massaged, "TO")
    
    return massaged

In [24]:
labeled_data = read_FTC("FTC-DEFCON Data Set 1.csv")
unlabeled_data = read_FTC("FTC-DEFCON Data Set 2.csv")

In [41]:
# This assumes you have the data locally
massaged_labeled_data = massage_ftc_dataframe(labeled_data)
massaged_labeled_data.head()


Out[41]:
TO FROM DATE/TIME LIKELY ROBOCALL HOUR MINUTE TO_AREA_CODE FROM_AREA_CODE TO_OFFICE_CODE FROM_OFFICE_CODE TIMECHUNK NUM_FROM_CALLS NUM_TO_CALLS
0 17866291260 13055793696 2014-04-01 False 0 0 786 305 786629 305579 0 72 70
1 14027826713 12063339487 2014-04-01 True 0 0 402 206 402782 206333 0 55 6
2 17083187970 12246108402 2014-04-01 False 0 0 708 224 708318 224610 0 22 28
3 17733095581 13035009570 2014-04-01 True 0 0 773 303 773309 303500 0 22 11
4 19188765408 16153878533 2014-04-01 True 0 0 918 615 918876 615387 0 33 2

In [42]:
massaged_unlabeled_data = massage_ftc_dataframe(unlabeled_data)
massaged_unlabeled_data.head()


Out[42]:
TO FROM DATE/TIME LIKELY ROBOCALL HOUR MINUTE TO_AREA_CODE FROM_AREA_CODE TO_OFFICE_CODE FROM_OFFICE_CODE TIMECHUNK NUM_FROM_CALLS NUM_TO_CALLS
0 16163847430 13236069958 2014-06-01 False 0 0 616 323 616384 323606 0 7 11
1 12025176283 12029867020 2014-06-01 False 0 0 202 202 202517 202986 0 13 48
2 18663049187 15159256650 2014-06-01 False 0 0 866 515 866304 515925 0 1 1
3 15594157085 16199247140 2014-06-01 False 0 0 559 619 559415 619924 0 47 10
4 18582407865 19492012595 2014-06-01 False 0 0 858 949 858240 949201 0 1 34

In [43]:
massaged_unlabeled_data.tail()


Out[43]:
TO FROM DATE/TIME LIKELY ROBOCALL HOUR MINUTE TO_AREA_CODE FROM_AREA_CODE TO_OFFICE_CODE FROM_OFFICE_CODE TIMECHUNK NUM_FROM_CALLS NUM_TO_CALLS
201515 14435522376 14436426683 2014-06-06 23:59:00 False 23 59 443 443 443552 443642 23.75 66 98
201516 17325876492 15169325497 2014-06-06 23:59:00 False 23 59 732 516 732587 516932 23.75 425 9
201517 14159683941 12241676708 2014-06-06 23:59:00 False 23 59 415 224 415968 224167 23.75 28 193
201518 16204321022 17853507101 2014-06-06 23:59:00 False 23 59 620 785 620432 785350 23.75 16 18
201519 13475865534 15708781910 2014-06-06 23:59:00 False 23 59 347 570 347586 570878 23.75 2 2

In [44]:
massaged_labeled_data.columns


Out[44]:
Index([u'TO', u'FROM', u'DATE/TIME', u'LIKELY ROBOCALL', u'HOUR', u'MINUTE', u'TO_AREA_CODE', u'FROM_AREA_CODE', u'TO_OFFICE_CODE', u'FROM_OFFICE_CODE', u'TIMECHUNK', u'NUM_FROM_CALLS', u'NUM_TO_CALLS'], dtype='object')

Now to build our Random Forest and see how it fares


In [53]:
# Scoring system for contest
# Not 0-1 loss...
def score(our_predictions, true_results):
    '''Scoring system for the FTC contest. Not 0-1 loss.'''
    our_score = 0
    for i in range(len(true_results)):
        if (our_predictions[i] == True and true_results[i] == True):
            our_score += 1
        if (our_predictions[i] == True and true_results[i] == False):
            our_score -= 1
    return our_score


# features is only a copy of the dataframe, can't use this
#def label_encode(features, feature_name):
#    feature_encoder = preprocessing.LabelEncoder()
#    features[feature_name] = feature_encoder.fit_transform(features[feature_name])
#    return feature_encoder

def enriched_data_to_features(enriched_data):
    '''Takes a pandas DataFrame with enriched FTC data, returns features and target labels.'''
    categorical_feature_names = [

            "TO_AREA_CODE",
            "FROM_AREA_CODE",
            "TO_OFFICE_CODE",
            "FROM_OFFICE_CODE",

            #"TOTZ",
            #"FROMTZ",
            #"SAMEAREACODE",
            #"WITHIN_THREE_MINUTES",
            #"FROMVALID",
            "TIMECHUNK",
            #"ISWEEKDAY", # Undecided on whether this will generalize since
                          # training and test data have different weekdays
                          # And the labeled data is missing Mondays
    ]
    
    numerical_feature_names = ["NUM_FROM_CALLS", "NUM_TO_CALLS"]
    
    feature_names = categorical_feature_names + numerical_feature_names

    features = enriched_data[feature_names]
    
    for feature_name in categorical_feature_names:
        print("Creating categorical feature {}".format(feature_name))
        encoder = preprocessing.LabelEncoder()
        features[feature_name] = encoder.fit_transform(features[feature_name])
    
    target = enriched_data["LIKELY ROBOCALL"].values
    
    return features, target
    

def train(features, target, min_samples_split=285):
    classifier = RandomForestClassifier(n_estimators=200, 
                                        verbose=0,
                                        n_jobs=-1,
                                        min_samples_split=min_samples_split,
                                        random_state=1,
                                        oob_score=True)

    classifier.fit(features, target)
    print("Resulting OOB Score: {}".format(classifier.oob_score_))

    return classifier

In [48]:
# Separate into training and test sets based on FROM
# This won't be needed when reading in testing data set;
# for that, train on full data and then use .predict()

from_numbers = massaged_labeled_data["FROM"].unique()

# 70% / 30%
num_train = int(round(.7 * len(from_numbers)))
num_test = len(from_numbers) - num_train
train_samples = np.random.choice(from_numbers, num_train)


train_data = massaged_labeled_data[massaged_labeled_data['FROM'].isin(train_samples)]
test_data = massaged_labeled_data[~massaged_labeled_data['FROM'].isin(train_samples)]

In [58]:
# For development
print("Enriching Training Data")
train_features, train_target = enriched_data_to_features(train_data)
print("Enriching Testing Data")
test_features, test_target = enriched_data_to_features(test_data)

min_samples_split_values = np.arange(150, 290, 5)

num_parameter_trials = len(min_samples_split_values)

# create dataframe
score_frame = pd.DataFrame(index=np.arange(0, num_parameter_trials), columns=('min_samples_split', 'test_score', 'train_score') )

for trial in np.arange(0, num_parameter_trials):
    
    c = min_samples_split_values[trial]
    classifier = train(train_features, train_target, c)
    our_predictions = classifier.predict(test_features)
    our_train_predictions = classifier.predict(train_features)
    
    score_frame.loc[trial] = [c, score(our_predictions, test_target), score(our_train_predictions, train_target) ]


Enriching Training Data
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Enriching Testing Data
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Resulting OOB Score: 0.925460738398
Resulting OOB Score: 0.924730855787
Resulting OOB Score: 0.925460738398
Resulting OOB Score: 0.92317985524
Resulting OOB Score: 0.923605620096
Resulting OOB Score: 0.921400766377
Resulting OOB Score: 0.919925795268
Resulting OOB Score: 0.919500030412
Resulting OOB Score: 0.919925795268
Resulting OOB Score: 0.918967824342
Resulting OOB Score: 0.917386412019
Resulting OOB Score: 0.917720941549
Resulting OOB Score: 0.916413235205
Resulting OOB Score: 0.914938264096
Resulting OOB Score: 0.915440058391
Resulting OOB Score: 0.915151146524
Resulting OOB Score: 0.913235204671
Resulting OOB Score: 0.913797822517
Resulting OOB Score: 0.911334468706
Resulting OOB Score: 0.911516939359
Resulting OOB Score: 0.911668998236
Resulting OOB Score: 0.911349674594
Resulting OOB Score: 0.91071102731
Resulting OOB Score: 0.90897755611
Resulting OOB Score: 0.908947144334
Resulting OOB Score: 0.906924761268
Resulting OOB Score: 0.90604281978
Resulting OOB Score: 0.906894349492

In [59]:
score_frame


Out[59]:
min_samples_split test_score train_score
0 150 4578 17775
1 155 4734 17659
2 160 4651 17687
3 165 4691 17573
4 170 4693 17554
5 175 4671 17403
6 180 4666 17343
7 185 4711 17326
8 190 4635 17274
9 195 4828 17244
10 200 4626 17143
11 205 4689 17149
12 210 4929 17074
13 215 4688 16988
14 220 4686 16975
15 225 4855 16924
16 230 4641 16907
17 235 4753 16885
18 240 4661 16701
19 245 4731 16745
20 250 4778 16702
21 255 4776 16614
22 260 4918 16619
23 265 4897 16552
24 270 4982 16482
25 275 5000 16370
26 280 4798 16289
27 285 4816 16360

In [64]:
# For the sake of the contest now, we'll train on the entire FTC1 dataset and then fit to the FTC2 dataset

train_data = massaged_labeled_data
test_data = massaged_unlabeled_data

train_features, train_target = enriched_data_to_features(train_data)

test_features, _ = enriched_data_to_features(test_data)

c = 285 # Determined during contest, not sure of now that I've run through it again
classifier = train(train_features, train_target, c)
predictions = classifier.predict(test_features)

contest_results = unlabeled_data[["FROM", "TO", "DATE/TIME"]]

contest_results["LIKELY ROBOCALL"] = predictions
contest_results["LIKELY ROBOCALL"] = X["LIKELY ROBOCALL"].map(lambda x: "X" if x else "")
contest_results.to_csv("predictions.csv", index=False)


Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Creating categorical feature TO_AREA_CODE
Creating categorical feature FROM_AREA_CODE
Creating categorical feature TO_OFFICE_CODE
Creating categorical feature FROM_OFFICE_CODE
Creating categorical feature TIMECHUNK
Resulting OOB Score: 0.910907595204

In [65]:
!ls


Analyzing Rachel the Robo Caller.ipynb	Modeling Rachel the Robo Caller.ipynb
Dockerfile				predictions.csv
enrich.ipynb				rachel.py
FTC-DEFCON Data Set 1.csv		README.md
FTC-DEFCON Data Set 2.csv		requirements.txt
LICENSE					Untitled0.ipynb

In [ ]: